#Include the libraries we are going to need here
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(plyr)
#library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x purrr::lift()      masks caret::lift()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggcorrplot)
library(knitr)
library(splines)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.0-2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)
library(RColorBrewer)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(gbm)
## Loaded gbm 2.1.8
library(arules) # for Association Rules analysis
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(gridExtra)
library(tree)
## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(arulesViz)
library(DMwR)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'DMwR'
## The following object is masked from 'package:plyr':
## 
##     join
nb.cols <- 18
mycolors <- colorRampPalette(brewer.pal(8, "BrBG"))(nb.cols)
cookiecol<-c('#ad6a1d','#9a5327','#cc8d4a','#4e1703','#ecc78d','#2e0a05','#d0dfe4','#33575b', '#173742')

1 Background

The main dataset is the transaction record of a bakery (https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery). Though not specified in the description of the dataset, it is implied that this bakery is located in the old town of Edinburgh, UK. The dataset is downloaded from kaggle and stored under the same path as this R markdown file.

In order to supplement the dataset with more features, we extract the historical weather records from Meteostat (https://dev.meteostat.net/python/hourly.html#response-parameters). We use the weather data recorded by a weather station in Edinburgh Airport, which is about 8 miles away from the Edinburgh old town. The dataset is downloaded and stored under the same path of this R markdown file.

2 Questions To Be Answered

Combining the bakery transaction record and the historical weather record, there are few questions we could further explore, such as:

  1. Knowing the date, hour, and weather forecast, to predict the number of transactions during that specific hour.

  2. Which items tend to be purchased together?

  3. Is there any cluster of the transactions?

3 Data Cleaning and Wrangling

3.1 Read The Dataset

Firstly we will read the dataset of the bakery transaction record, and the hourly weather record in Edinburgh.

df_ori <- read.csv(file = 'BreadBasket_DMS.csv',
               header = TRUE,
               encoding = 'utf-8')

df_weather_ori <- read.csv(file = 'Edinburgh_weather_hourly.csv',
                       header = TRUE,
                       encoding = 'utf-8')

Then we display the first few rows of each data set.

head(df_ori)
##         Date     Time Transaction          Item
## 1 2016-10-30 09:58:11           1         Bread
## 2 2016-10-30 10:05:34           2  Scandinavian
## 3 2016-10-30 10:05:34           2  Scandinavian
## 4 2016-10-30 10:07:57           3 Hot chocolate
## 5 2016-10-30 10:07:57           3           Jam
## 6 2016-10-30 10:07:57           3       Cookies
head(df_weather_ori)
##            time temp dwpt rhum prcp snow wdir wspd wpgt pres tsun coco station
## 1 1/1/2016 0:00    2  0.1   87   NA   NA   NA   NA   NA 1013   NA   NA    3160
## 2 1/1/2016 1:00    2  0.1   87   NA   NA  240 22.3   NA 1014   NA   NA    3160
## 3 1/1/2016 2:00    2  0.1   87   NA   NA  230 25.9   NA 1015   NA   NA    3160
## 4 1/1/2016 3:00    3 -1.0   75   NA   NA  230 20.5   NA 1015   NA   NA    3160
## 5 1/1/2016 4:00    2 -2.0   75   NA   NA  240  9.4   NA 1016   NA   NA    3160
## 6 1/1/2016 5:00    2 -2.0   75   NA   NA   NA  3.6   NA 1016   NA   NA    3160

We will create the keys in order to merge two datasets.

df <- df_ori
df_weather <- df_weather_ori
df$Date <- as.Date(df$Date, "%Y-%m-%d")
df$Hour <- as.numeric(substr(df$Time, 1, 2))
df$key <- paste(as.character(df$Date), "@", as.character(df$Hour))
df_weather <- df_weather %>% separate(time, c("Date", "Hour_Minute"), " ")
df_weather <- df_weather %>% separate(Hour_Minute, c("Hour", "Minute"), ":")
df_weather$Date <- as.Date(df_weather$Date, "%m/%d/%Y")
df_weather$Hour <- as.numeric(df_weather$Hour)
df_weather$key <- paste(as.character(df_weather$Date), "@", as.character(df_weather$Hour))

3.2 The Bakery Transaction Record

3.2.1 NA Values and Duplicated Rows

Now we will deal with the data type and missing values, if any. We will start with the bakery transaction record database.

summary(df)
##       Date                  Time        Transaction        Item     
##  Min.   :2016-10-30   12:07:39:   16   Min.   :   1   Coffee :5471  
##  1st Qu.:2016-12-03   10:45:21:   13   1st Qu.:2548   Bread  :3325  
##  Median :2017-01-21   10:55:19:   13   Median :5067   Tea    :1435  
##  Mean   :2017-01-17   14:38:01:   13   Mean   :4952   Cake   :1025  
##  3rd Qu.:2017-02-28   13:43:08:   12   3rd Qu.:7329   Pastry : 856  
##  Max.   :2017-04-09   14:19:47:   12   Max.   :9684   NONE   : 786  
##                       (Other) :21214                  (Other):8395  
##       Hour           key           
##  Min.   : 1.00   Length:21293      
##  1st Qu.:10.00   Class :character  
##  Median :12.00   Mode  :character  
##  Mean   :12.25                     
##  3rd Qu.:14.00                     
##  Max.   :23.00                     
## 

There is no NAs. Then we will verify whether there are duplicated rows in the dataset

dim(df[duplicated(df), ])[1]
## [1] 1653

There are duplicated rows. Considering that our analysis will not focus on the quantities of item sold, it is OK to remove those duplicated rows.

df <- distinct(df)

3.2.2 Weekday vs. Weekend

Given the nature of the bakery business, we may expect different behaviors during weekdays and weekends.

df$Weekday <- weekdays(df$Date, abbreviate = TRUE)

df$Weekday <- factor(df$Weekday, 
                        levels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))


res <- ddply(df, ~Weekday, summarise, No_of_transaction = length(unique(Transaction)))

ggplot(data = res, mapping = aes(x = Weekday, y = No_of_transaction, fill = Weekday)) +
  geom_bar(stat = 'identity',width=0.7,alpha=0.8) +
  labs(title = 'No. of Transactions by Weekdays', x = 'Weekdays', y = 'Number of Transactions') +
  scale_fill_brewer(palette = "BrBG") +
  theme_minimal()

As expected, there are more transactions on Saturday. However, the number of transactions on Sunday seem to be low. To further understand what happened, we will look at the number of transactions by hour by weekdays.

res <- ddply(df, .(Weekday, Hour), summarise, No_of_transaction = length(unique(Transaction)))

ggplot(data = res, mapping = aes(x = as.factor(Hour), y = No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = "identity",alpha=0.8) + facet_wrap(~ Weekday) +
  labs(title = 'No. of Transactions by Hour by Weekdays', x = 'Hours', y = 'Number of Transactions') +
  scale_fill_manual(values = mycolors) +
  theme_minimal() + 
  theme(legend.position = "none",
axis.text = element_text(size=8),
)

Looking at the distribution of transactions by hours, we noticed that the trend for Saturday and Sunday are similar and are different from the ones of the other days. Therefore, it makes sense to group Saturday and Sunday together, though Sunday has less transactions comparing with Saturday. This also implies that we may want to split the dataset into weekdays data and weekend data to perform further analysis.

df$Weekend <- mapvalues(df$Weekday,
                        from = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'),
                        to = c(0, 0, 0, 0, 0, 1, 1))

3.2.3 Hours of The Day

From the figure presented in 3.2.2, we noticed that there are few transactions associated with abnormal hours, such as 1 am and 11 pm.

res <- ddply(df, ~Hour, summarise, No_of_transaction = length(unique(Transaction)))
ggplot(data = res, mapping = aes(x = as.factor(Hour), y = No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'No. of Transactions by Hour', x = 'Hours', y = 'Number of Transactions') +
  scale_fill_manual(values = mycolors) +
  theme_minimal() +
  theme(legend.position = "none")

Considering the small amount of transactions associated with abnormal hours, we will drop the rows whose hours is outside a normal business operating time. In other words, we will drop rows whose hour is 1, 21, 22, or 23 from our dataset.

df <- df%>%filter(Hour<21&Hour>1)

Another observation is that the amount of transaction within an hour varies by the time of the day. Here we split the day into two segments: rush hours from 9 to 15 and non-rush hours for the rest of the day. Such split makes sense practically as the period from 9 to 15 covers breakfast, lunch, and coffee or tea time in the afternoon.

df$Rush_hours <- mapvalues(df$Hour,
                        from = c(7, 8,
                                 9, 10, 11, 12, 13, 14, 15,
                                 16, 17, 18, 19, 20),
                        to = c(0, 0,
                               1, 1, 1, 1, 1, 1, 1,
                               0, 0, 0, 0, 0))
df$Rush_hours<-as.factor(df$Rush_hours)

3.2.4 Holidays

We also want to take a look at sales during holidays. We will focus on two holidays, Christmas and New Year. 0 = non-holiday 1 = Christmas|New Year (12/24/2016 - 1/1/2017) Bakery did not open on 12/25/2016, 12/26/2016, 1/1/2017

df$Holiday<-0
df$Holiday[df$Date>'2016-12-23'&df$Date<'2017-01-02']<-1
hol<-df%>%group_by(Holiday)%>%summarize(mean_transaction = length(unique(Transaction))/length(unique(Date)))
## `summarise()` ungrouping output (override with `.groups` argument)
hol%>%ggplot(aes(as.factor(Holiday),mean_transaction, fill = as.factor(Holiday),label = round(mean_transaction,2))) +
  labs(title = 'Average No. of Transactions by Holiday', x = 'Holiday', y = 'Average Number of Transactions') +
  geom_bar(stat = 'identity', width = 0.5,alpha=0.8) +
  geom_text(vjust = -0.5) +
  scale_fill_manual(values = cookiecol[c(5,7)]) +
  theme_minimal()

df$Holiday<-as.factor(df$Holiday)

Indeed, the average number of transaction on holidays is 12% less than the one of non-holidays.

3.2.5 List of Purchased Items

Then we will focus on the “Item” column to understand what are included.

Item_tb <- table(df$Item)
sort(Item_tb, decreasing = TRUE)
## 
##                        Coffee                         Bread 
##                          4528                          3096 
##                           Tea                          Cake 
##                          1350                           983 
##                        Pastry                          NONE 
##                           815                           753 
##                      Sandwich                     Medialuna 
##                           680                           585 
##                 Hot chocolate                       Cookies 
##                           551                           515 
##                       Brownie                    Farm House 
##                           379                           371 
##                         Juice                        Muffin 
##                           364                           364 
##                     Alfajores                         Scone 
##                           344                           327 
##                          Soup                         Toast 
##                           326                           318 
##                  Scandinavian                      Truffles 
##                           274                           192 
##                          Coke                Spanish Brunch 
##                           184                           172 
##                      Baguette                        Tiffin 
##                           152                           146 
##                         Fudge                           Jam 
##                           142                           142 
##                 Mineral water                Jammie Dodgers 
##                           133                           125 
##                  Chicken Stew             Hearty & Seasonal 
##                           123                           100 
##                         Salad                      Frittata 
##                            99                            81 
##                     Smoothies              Keeping It Local 
##                            77                            63 
##                     The Nomad                      Focaccia 
##                            58                            54 
##                Vegan mincepie                      Bakewell 
##                            52                            48 
##                       Tartine      Afternoon with the baker 
##                            46                            43 
##                      Art Tray          Extra Salami or Feta 
##                            38                            38 
##                          Eggs                       Granola 
##                            28                            28 
##                        Tshirt              My-5 Fruit Shoot 
##                            21                            18 
##        Ella's Kitchen Pouches                        Crisps 
##                            17                            14 
##                Dulce de Leche                      Duck egg 
##                            13                            12 
##                  Kids biscuit            Pick and Mix Bowls 
##                            12                            12 
##              Christmas common                Mighty Protein 
##                            11                            11 
##                  Tacos/Fajita              Valentine's card 
##                            11                            11 
##                      Postcard                    Chocolates 
##                            10                             9 
##             Gingerbread syrup                   Vegan Feast 
##                             9                             9 
##    Drinking chocolate spoons                         Muesli 
##                             8                             8 
##                     Nomad bag               Argentina Night 
##                             8                             7 
##              Coffee granules                      Empanadas 
##                             7                             7 
##              Victorian Sponge                        Basket 
##                             7                             6 
##                        Crepes           Half slice Monster  
##                             6                             6 
##                         Honey             Lemon and coconut 
##                             6                             6 
##                       Pintxos                  Bare Popcorn 
##                             6                             5 
##                      Mortimer                      Panatone 
##                             5                             5 
##                 Bread Pudding            Brioche and salami 
##                             4                             3 
##                 Caramel bites         Cherry me Dried fruit 
##                             3                             3 
## Raspberry shortbread sandwich                 Bowl Nic Pitt 
##                             3                             2 
##               Chimichurri Oil                   Fairy Doors 
##                             2                             2 
##                Hack the stack                      Siblings 
##                             2                             2 
##                        Spread                    Adjustment 
##                             2                             1 
##                         Bacon                  Chicken sand 
##                             1                             1 
##                  Gift voucher                Olum & polenta 
##                             1                             1 
##                       Polenta                      Raw bars 
##                             1                             1 
##                      The BART 
##                             1

This list of itme seems to be inconsistent and confusing. For example, there are items named “NONE”. Also, Brownie is separated from Cakes, Baguette is not considered as Bread, and Medialuna is treated differently from Pastry.

The item type “Adjustment” and “None” are probably introduced by the transaction tracking system or the cashier. In other words, there is no real purchase behind each of them. So we will drop them from the dataset. Then, the ‘Item_Type’ column is reorganized and coded as following:

Bread = 1

Cookies = 2

Cake|Pastry|Sweets = 3

Coffee = 4

Tea = 5

Hot chocolate|Smoothie|Juice = 6

Other beverage = 7

Meal = 8

Other = 9

*Ambiguous items are coded as ‘Other’

df <- df %>% filter(Item!='NONE' & Item!='Adjustment')
df$Item_Type <- 9
df$Item_Type[df$Item%in%c('Bread', 'Farm House', 'Toast','Baguette','Focaccia')]<-1
df$Item_Type[df$Item == 'Cookies']<-2
df$Item_Type[df$Item%in%c('Cake','Pastry','Medialuna','Brownie','Muffin','Alfajores','Scone','Scandinavian','Truffles','Tiffin','Fudge','Jammie Dodgers','Bakewell','Tartine','Vegan mincepie')]<-3
df$Item_Type[df$Item == 'Coffee']<-4
df$Item_Type[df$Item == 'Tea']<-5
df$Item_Type[df$Item%in%c('Hot chocolate', 'Juice', 'Smoothies')]<-6
df$Item_Type[df$Item%in%c('Mineral water', 'Coke')]<-7
df$Item_Type[df$Item%in%c('Sandwich', 'Soup', 'Spanish Brunch', 'Chicken Stew', 'Salad','Frittata')]<-8
df$Item_Type<-as.factor(df$Item_Type)
df%>%group_by(Item_Type)%>%
  summarize(Count = n()) %>%
  ggplot(aes(x=Item_Type,y=Count,fill=Item_Type,label = Count)) +
  geom_bar(stat="identity", width=0.7,alpha=0.7) +
  theme_minimal() + 
  scale_fill_manual(values = cookiecol,labels = c("1:Bread", "2:Cookies", "3:Cake|Pastry|Sweets",'4:Coffee','5:Tea','6:Hot chocolate|Smoothie|Juice','7:Other beverage','8:Meal','9:Other')) +
  geom_text(vjust = -0.5,size=3) + 
  labs(title = 'Item Frequency in Unique Transactions', x = 'Item Type', y = 'Number of Transactions') 
## `summarise()` ungrouping output (override with `.groups` argument)

From the Figure above, cakes/pastries/sweets are the most popular items in the bakery, followed by Coffee and Bread.

We also want to see if the transaction of each item varies by the hours in the day. We decided to focus on bread, cookies, cake/pastry/sweets, coffee and tea.

bread<-df%>%filter(Item_Type==1)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  facet_wrap(~ Item_Type) +
  labs(title = 'Average Number of Transactions of Bread by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)


cookie<-df%>%filter(Item_Type==2)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cookies by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

pastry<-df%>%filter(Item_Type==3)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cakes/Pastries/Sweets by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

coffee<-df%>%filter(Item_Type==4)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Coffee by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

tea<-df%>%filter(Item_Type==5)%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Tea by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

grid.arrange(bread,cookie,pastry,coffee,tea, nrow=3,ncol=2,
             top = textGrob("Average Transaction Frequency by Hours: Bread,Cookies,Pastries,Coffee, and Tea",gp=gpar(fontsize=14,font=3)))

The Figure above shows that bread and coffee on average tend to be sold more in the morning (~11am) while tea tends to be sold more in the afternoon. Transaction frequencies of cookies and cakes/pastries/sweet have peaks in both morning and afternoon.

3.3 The Edinburgh Weather Record

Let’s start with the high level summary of the dataset.

summary(df_weather)
##       Date                 Hour          Minute               temp       
##  Min.   :2016-01-01   Min.   : 0.00   Length:17544       Min.   :-8.000  
##  1st Qu.:2016-07-01   1st Qu.: 5.00   Class :character   1st Qu.: 5.000  
##  Median :2016-12-31   Median :11.00   Mode  :character   Median :10.000  
##  Mean   :2016-12-31   Mean   :11.50                      Mean   : 9.165  
##  3rd Qu.:2017-07-02   3rd Qu.:17.25                      3rd Qu.:13.000  
##  Max.   :2018-01-01   Max.   :23.00                      Max.   :28.000  
##                                                                          
##       dwpt             rhum          prcp           snow        
##  Min.   :-8.900   Min.   : 29.00   Mode:logical   Mode:logical  
##  1st Qu.: 2.100   1st Qu.: 76.00   NA's:17544     NA's:17544    
##  Median : 6.900   Median : 86.00                                
##  Mean   : 6.165   Mean   : 82.56                                
##  3rd Qu.:10.000   3rd Qu.: 93.00                                
##  Max.   :18.000   Max.   :100.00                                
##                                                                 
##       wdir            wspd         wpgt              pres        tsun        
##  Min.   : 10.0   Min.   : 0.00   Mode:logical   Min.   : 965   Mode:logical  
##  1st Qu.:130.0   1st Qu.: 7.60   NA's:17544     1st Qu.:1006   NA's:17544    
##  Median :240.0   Median :13.00                  Median :1014                 
##  Mean   :200.5   Mean   :15.04                  Mean   :1013                 
##  3rd Qu.:250.0   3rd Qu.:20.50                  3rd Qu.:1020                 
##  Max.   :360.0   Max.   :66.60                  Max.   :1040                 
##  NA's   :1140    NA's   :28                     NA's   :824                  
##       coco           station         key           
##  Min.   : 5.000   Min.   :3160   Length:17544      
##  1st Qu.: 7.000   1st Qu.:3160   Class :character  
##  Median : 7.000   Median :3160   Mode  :character  
##  Mean   : 7.139   Mean   :3160                     
##  3rd Qu.: 7.000   3rd Qu.:3160                     
##  Max.   :25.000   Max.   :3160                     
##  NA's   :14714

There are NA values in some of the columns. The columns “prcp”, “snow”, “wpgt” and “tsun” contain only NA values, so we will drop them.

The website of Meteostat gives clear explanation of each column (https://dev.meteostat.net/python/hourly.html#response-parameters):

  1. station: The Meteo ID of the weather station

  2. temp: The air temperature in C

  3. dwpt: The dew point in C

  4. rhum: The relative humidy in percent

  5. wdir: The average wind direction in degrees

  6. wspd: The average wind speed in km/h

  7. pres: The average sea-level air pressure in hPa

  8. coco: The weather condition code (https://dev.meteostat.net/docs/formats.html#weather-condition-codes). Only significant weather events are reported here.

Based on the description of each columns, we could drop the “station” column, as it remains the same for all the observations. We could also drop the “dwpt” column, as we already include the “temp” in our features. Another variable to drop is “pres”, as the air pressure is difficult to interpret for people without the background in Meteorology. For the missing values in “coco”, we will fill 0, as it stands for non-significant weather events.

We will also remove the “Minute” column as it is always 00.

df_weather <- subset(df_weather,
                     select = -c(Minute, prcp, snow, wpgt, tsun, station, dwpt, pres))

For the missing values in “wdir” and “wspd”, considering the consistency of weather conditions, we will use the value of the hour right after.

df_weather$coco[is.na(df_weather$coco)] <- 0

df_weather <- df_weather %>% tidyr::fill(wdir, .direction = "up")
df_weather <- df_weather %>% tidyr::fill(wspd, .direction = "up")

3.4 The Combined Dataset

We now will merge the two datasets based on the pre-defined keys.

df_merged <- merge(x = df, y = df_weather, by = 'key', all.x = TRUE)
df_merged$Date <- df_merged$Date.x
df_merged$Hour <- df_merged$Hour.x
df_merged <- subset(df_merged,
                    select = -c(key, Time, Date.y, Hour.y, Date.x, Hour.x))
df_merged$coco<-as.factor(df_merged$coco)
df_merged_feature <- subset(df_merged, select = -c(Transaction, Item, Weekday,Weekend, Item_Type, Date,Holiday,coco,Rush_hours))
#Calculate correlation here
corr <- round(cor(df_merged_feature), digits = 2)

#Use ggcorrplot to graph correlation. Only plot the lower triangle of the correlation matrix.
ggcorrplot(corr, type = "lower", 
           ggtheme = ggplot2::theme_minimal,
           lab = TRUE,
           colors = c(cookiecol[5],'white',cookiecol[7]))

The correlation table suggests possible correlation between the temperature and the humidity, and between the temperature and the speed of wind.

3.5 Dataset Construction

3.5.1 Predict Transactions By Day of week, Hour, and Weather Information

# To build the dataset for the regression problem
# Keep Hour as a feature
df_merged_reg<-df_merged%>%
  group_by(Date, Hour)%>%
  mutate(No_of_transaction = length(unique(Transaction)))%>%
  ungroup()%>%
  #select(-c(Transaction, Item, Item_Type,Date,Hour))%>%
  select(-c(Transaction, Item, Item_Type,Date))%>%
  unique()%>%
  data.frame()

# Keep Weekday as a feature
df_weekday<-df_merged_reg%>%
  filter(Weekend==0)%>%
  #select(-Weekday,-Weekend)%>%
  select(-Weekend)%>%
  data.frame()

summary(df_weekday)
##  Weekday   Rush_hours Holiday       temp            rhum             wdir      
##  Mon:201   0:308      0:1065   Min.   :-6.00   Min.   : 47.00   Min.   : 10.0  
##  Tue:225   1:790      1:  33   1st Qu.: 5.00   1st Qu.: 71.25   1st Qu.:200.0  
##  Wed:222                       Median : 7.00   Median : 81.00   Median :240.0  
##  Thu:228                       Mean   : 6.85   Mean   : 80.01   Mean   :212.5  
##  Fri:222                       3rd Qu.: 9.00   3rd Qu.: 87.00   3rd Qu.:260.0  
##  Sat:  0                       Max.   :15.00   Max.   :100.00   Max.   :360.0  
##  Sun:  0                                                                       
##       wspd       coco          Hour       No_of_transaction
##  Min.   : 0.00   0 :978   Min.   : 7.00   Min.   : 1.000   
##  1st Qu.: 9.40   5 : 20   1st Qu.:10.00   1st Qu.: 3.000   
##  Median :14.80   7 : 84   Median :12.00   Median : 5.000   
##  Mean   :16.84   8 :  2   Mean   :12.42   Mean   : 5.593   
##  3rd Qu.:24.10   12:  5   3rd Qu.:15.00   3rd Qu.: 8.000   
##  Max.   :55.40   17:  9   Max.   :20.00   Max.   :17.000   
## 
# Keep Weekday as a feature
df_weekend<-df_merged_reg%>%
  filter(Weekend==1)%>%
  #select(-Weekday,-Weekend)%>%
  select(-Weekend)%>%
  data.frame()

summary(df_weekend)
##  Weekday   Rush_hours Holiday      temp             rhum             wdir      
##  Mon:  0   0: 83      0:378   Min.   :-2.000   Min.   : 31.00   Min.   : 10.0  
##  Tue:  0   1:313      1: 18   1st Qu.: 4.000   1st Qu.: 76.00   1st Qu.:220.0  
##  Wed:  0                      Median : 7.000   Median : 82.00   Median :240.0  
##  Thu:  0                      Mean   : 7.452   Mean   : 81.41   Mean   :218.7  
##  Fri:  0                      3rd Qu.:10.000   3rd Qu.: 93.00   3rd Qu.:260.0  
##  Sat:226                      Max.   :16.000   Max.   :100.00   Max.   :360.0  
##  Sun:170                                                                       
##       wspd       coco          Hour       No_of_transaction
##  Min.   : 0.00   0 :333   Min.   : 7.00   Min.   : 1.000   
##  1st Qu.: 9.40   5 : 10   1st Qu.:10.00   1st Qu.: 5.000   
##  Median :14.80   7 : 48   Median :12.00   Median : 8.000   
##  Mean   :16.44   8 :  0   Mean   :12.44   Mean   : 8.359   
##  3rd Qu.:22.30   12:  1   3rd Qu.:15.00   3rd Qu.:11.000   
##  Max.   :48.20   17:  4   Max.   :20.00   Max.   :25.000   
## 

Next, we plot scatter plots of number of transaction against different weather features, respectively.

# scatter plots of number of transaction against different weather features, respectively
temp<-df_merged_reg%>%ggplot(aes(temp,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Temperature in Celcius', y='Number of Transaction') + 
  theme_minimal()

ws<-df_merged_reg%>%ggplot(aes(wspd,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Wind Speed', y='Number of Transaction') + 
  theme_minimal()

hum<-df_merged_reg%>%ggplot(aes(rhum,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Humidity', y='Number of Transaction') + 
  theme_minimal()

wcc<-df_merged_reg%>%ggplot(aes(coco,No_of_transaction)) + 
  geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
  labs(x='Weather Condition Code', y='Number of Transaction') + 
  theme_minimal()

grid.arrange(temp,ws,hum,wcc, nrow = 2)

3.5.2 Market Basket Analysis

Here we will construct the dataset to perform market basket analysis at a later stage.

df_merged_mba <- df_merged

# map back the name of Item type

df_merged_mba$Item_Type_Name[df$Item_Type == 1] <- 'Bread'
df_merged_mba$Item_Type_Name[df$Item_Type == 2] <- 'Cookies'
df_merged_mba$Item_Type_Name[df$Item_Type == 3] <- 'Cake|Pastry|Sweets'
df_merged_mba$Item_Type_Name[df$Item_Type == 4] <- 'Coffee'
df_merged_mba$Item_Type_Name[df$Item_Type == 5] <- 'Tea'#'Other beverage'#'Tea'
df_merged_mba$Item_Type_Name[df$Item_Type == 6] <- 'Hot chocolate|Smoothie|Juice'#'Other beverage'
df_merged_mba$Item_Type_Name[df$Item_Type == 7] <- 'Other beverage'
df_merged_mba$Item_Type_Name[df$Item_Type == 8] <- 'Meal'
df_merged_mba$Item_Type_Name[df$Item_Type == 9] <- 'Other'


df_merged_mba_item_type_temp <- subset(df_merged_mba,
                                       select = c(Transaction, Item_Type_Name, Weekend))

df_merged_mba_item_type_temp <- distinct(df_merged_mba_item_type_temp)

# Remove the transactions including only one item type
df_merged_mba_item_type_temp <-
  df_merged_mba_item_type_temp %>%
  group_by(Transaction) %>%
  mutate(freq = n()) %>%
  data.frame()

df_merged_mba_item_type_temp <- df_merged_mba_item_type_temp[df_merged_mba_item_type_temp$freq > 1, ]


# MBA on category of items
df_merged_mba_item_type <- df_merged_mba_item_type_temp %>%
  group_by(Transaction) %>%
  summarise(basket = as.vector(list(Item_Type_Name)))
## `summarise()` ungrouping output (override with `.groups` argument)
# MBA on items
df_merged_mba_item <- df_merged_mba %>%
  group_by(Transaction) %>%
  mutate(basket = as.vector(list(Item)))%>%ungroup()%>%data.frame()

mba_wkday<-df_merged_mba_item%>%
  filter(Weekend==0)%>%
  select(-Weekend)

mba_wkend<-df_merged_mba_item%>%
  filter(Weekend==1)%>%
  select(-Weekend)

3.5.3 Study The Clusters of Transactions

Here we will add the number of type of items for each transaction.

df_merged_cluster<-df_merged%>%
  #group_by(Transaction)%>%
  #mutate(No_Item_Type = length(unique(Item)))%>%
  #ungroup()%>%
  select(-Transaction,-Item, -Weekday,-Date)%>%
  unique()%>%
  data.frame()

df_merged_cluster[,c(2:10)]<-sapply(df_merged_cluster[,c(2:10)], as.numeric)

cluster_wkday<-df_merged_cluster%>%filter(Weekend==0)%>%select(-Weekend)%>%data.frame()
cluster_wkend<-df_merged_cluster%>%filter(Weekend==1)%>%select(-Weekend)%>%data.frame()


summary(cluster_wkday)
##    Rush_hours       Holiday        Item_Type          temp       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :-6.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 5.000  
##  Median :2.000   Median :1.000   Median :4.000   Median : 7.000  
##  Mean   :1.805   Mean   :1.029   Mean   :4.196   Mean   : 7.063  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:6.000   3rd Qu.:10.000  
##  Max.   :2.000   Max.   :2.000   Max.   :9.000   Max.   :15.000  
##       rhum             wdir            wspd            coco      
##  Min.   : 47.00   Min.   : 10.0   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 71.00   1st Qu.:200.0   1st Qu.: 9.40   1st Qu.:1.000  
##  Median : 81.00   Median :240.0   Median :14.80   Median :1.000  
##  Mean   : 79.24   Mean   :212.6   Mean   :17.02   Mean   :1.232  
##  3rd Qu.: 87.00   3rd Qu.:260.0   3rd Qu.:24.10   3rd Qu.:1.000  
##  Max.   :100.00   Max.   :360.0   Max.   :55.40   Max.   :6.000  
##       Hour     
##  Min.   : 7.0  
##  1st Qu.:10.0  
##  Median :13.0  
##  Mean   :12.5  
##  3rd Qu.:15.0  
##  Max.   :20.0
summary(cluster_wkend)
##    Rush_hours       Holiday        Item_Type          temp      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :-2.00  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 5.00  
##  Median :2.000   Median :1.000   Median :4.000   Median : 8.00  
##  Mean   :1.855   Mean   :1.042   Mean   :4.529   Mean   : 7.63  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:6.000   3rd Qu.:10.00  
##  Max.   :2.000   Max.   :2.000   Max.   :9.000   Max.   :16.00  
##       rhum             wdir            wspd            coco      
##  Min.   : 31.00   Min.   : 10.0   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 76.00   1st Qu.:220.0   1st Qu.: 9.40   1st Qu.:1.000  
##  Median : 82.00   Median :240.0   Median :14.80   Median :1.000  
##  Mean   : 80.86   Mean   :216.1   Mean   :16.35   Mean   :1.328  
##  3rd Qu.: 93.00   3rd Qu.:260.0   3rd Qu.:22.30   3rd Qu.:1.000  
##  Max.   :100.00   Max.   :360.0   Max.   :48.20   Max.   :6.000  
##       Hour      
##  Min.   : 7.00  
##  1st Qu.:11.00  
##  Median :12.00  
##  Mean   :12.47  
##  3rd Qu.:14.00  
##  Max.   :20.00

4 Modeling and Analysis

4.1 Predicting The Number of Transactions

The task of this analysis is to predict the number of transactions during that specific hour, knowing the date, hour, and weather forecast. More importantly, from the model created, we expert to further understand how the sales is driven by external variables, such as weather, day of the week, time of the day and so on.

Firstly, we will split the dataset into training set and validation set.

set.seed(0)

train_wkday<-sample(1:nrow(df_weekday), round(0.8*dim(df_weekday)[1]))
train_wkend <- sample(1:nrow(df_weekend), round(0.8*dim(df_weekend)[1]))
num_features <- dim(df_weekday)[2]

4.1.1 Lasso Regression Model

We will start with Lasso regression model. Weekday

x_wkday <- model.matrix(No_of_transaction~.,df_weekday)[,-1]
y_wkday <- df_weekday$No_of_transaction

y.test_wkday <- y_wkday[-train_wkday]

grid=10^seq(10,-2, length =100)
lasso.mod=glmnet(x_wkday[train_wkday,],y_wkday[train_wkday],alpha=1,lambda=grid)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

cv.out=cv.glmnet(x_wkday[train_wkday,],y_wkday[train_wkday],alpha=1)
plot(cv.out)

Then we run the lasso model on the test set to see the RMSE.

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x_wkday[-train_wkday,])
#RMSE
lasso_RMSE_wkday <- sqrt(mean((lasso.pred-y.test_wkday)^2))
lasso_RMSE_wkday
## [1] 2.880674
out <- glmnet(x_wkday,y_wkday,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:8,]
lasso.coef
## (Intercept)  WeekdayTue  WeekdayWed  WeekdayThu  WeekdayFri  WeekdaySat 
##  3.65314614 -0.08972362 -0.38786168  0.00000000  1.04048780  0.00000000 
##  WeekdaySun Rush_hours1 
##  0.00000000  3.64261787
lasso.coef[lasso.coef!=0]
## (Intercept)  WeekdayTue  WeekdayWed  WeekdayFri Rush_hours1 
##  3.65314614 -0.08972362 -0.38786168  1.04048780  3.64261787

Weekend

x_wkend <- model.matrix(No_of_transaction~.,df_weekend)[,-1]
y_wkend <- df_weekend$No_of_transaction

y.test_wkend <- y_wkend[-train_wkend]

grid=10^seq(10,-2, length =100)
lasso.mod=glmnet(x_wkend[train_wkend,],y_wkend[train_wkend],alpha=1,lambda=grid)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

cv.out=cv.glmnet(x_wkend[train_wkend,],y_wkend[train_wkend],alpha=1)
plot(cv.out)

The test set RMSE of the model chosen by lasso is

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x_wkend[-train_wkend,])
#RMSE
lasso_RMSE_wkend <- sqrt(mean((lasso.pred-y.test_wkend)^2))
lasso_RMSE_wkend
## [1] 4.063711
out <- glmnet(x_wkend,y_wkend,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:8,]
lasso.coef
## (Intercept)  WeekdayTue  WeekdayWed  WeekdayThu  WeekdayFri  WeekdaySat 
##    3.491346    0.000000    0.000000    0.000000    0.000000    2.214838 
##  WeekdaySun Rush_hours1 
##    0.000000    5.117108
lasso.coef[lasso.coef!=0]
## (Intercept)  WeekdaySat Rush_hours1 
##    3.491346    2.214838    5.117108

4.1.2 Random Forest Modle and GBM Modle

Random Forest Secondly, we try another model - random forest regression model. We will try several different sets of hyperparameters, including ‘mtry’ and ‘ntree’ to figure out the best random forest regression model as baseline. Weekday

search_grid <- expand.grid(mtry = c(round(num_features/3/2),
                                    round(num_features/3),
                                    round(num_features/3*2)),
                           ntree = c(100, 500, 1000))

min_RMSE <- Inf
best_mtry <- 0
best_ntree <- 0

for (i in seq(1, nrow(search_grid))){
  rf_reg_wkday <- randomForest(No_of_transaction~.,
                           df_weekday[train_wkday, ],
                           mtry = search_grid[i, ]$mtry,
                           ntree = search_grid[i, ]$ntree,
                           importance = TRUE
                           )
  if (mean(rf_reg_wkday$mse) < min_RMSE) {
    min_RMSE <- mean(rf_reg_wkday$mse)
    best_mtry <- search_grid[i, ]$mtry
    best_ntree <- search_grid[i, ]$ntree
    
  }

  }

# Train the random forest model using the best 'mtry' and 'ntree'.
rf_reg_best_wkday <- randomForest(No_of_transaction~.,
                           df_weekday[train_wkday, ],
                           mtry = best_mtry,
                           ntree = best_ntree,
                           importance = TRUE
                           )

rf_reg_best_wkday
## 
## Call:
##  randomForest(formula = No_of_transaction ~ ., data = df_weekday[train_wkday,      ], mtry = best_mtry, ntree = best_ntree, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 7.123584
##                     % Var explained: 33.56

Then we train a regression trees model using boosting algorithm to compare with the random forest model constructed. Weekday

search_grid <- expand.grid(shrinkage = c(0.01, 0.001),
                           interaction.depth = c(4, 5),
                           n.minobsinnode = c(10, 100),
                           bag.fraction = c(0.5, 0.8),
                           optimal_trees = 0,
                           min_RMSE = 0)

for (i in seq(1, nrow(search_grid))){

  boost.fit_wkday <- gbm(No_of_transaction~. ,
                  data = df_weekday[train_wkday, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = search_grid[i, ]$shrinkage,
                  interaction.depth = search_grid[i, ]$interaction.depth,
                  n.minobsinnode = search_grid[i, ]$n.minobsinnode,
                  bag.fraction = search_grid[i, ]$bag.fraction,
                  )
  search_grid[i, ]$min_RMSE <- min(boost.fit_wkday$cv.error)
  search_grid[i, ]$optimal_trees = match(min(boost.fit_wkday$cv.error), boost.fit_wkday$cv.error)

}

# Train and fit this model with the best sets of hyperparameters:

boost_best_para_wkday <- 
  search_grid[search_grid$min_RMSE == min(search_grid$min_RMSE),]

boost.best_wkday <- gbm(No_of_transaction~. ,
                  data = df_weekday[train_wkday, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = boost_best_para_wkday$shrinkage,
                  interaction.depth = boost_best_para_wkday$interaction.depth,
                  n.minobsinnode = boost_best_para_wkday$n.minobsinnode,
                  bag.fraction = boost_best_para_wkday$bag.fraction,
                  )

boost.best_wkday
## gbm(formula = No_of_transaction ~ ., distribution = "gaussian", 
##     data = df_weekday[train_wkday, ], n.trees = 4000, interaction.depth = boost_best_para_wkday$interaction.depth, 
##     n.minobsinnode = boost_best_para_wkday$n.minobsinnode, shrinkage = boost_best_para_wkday$shrinkage, 
##     bag.fraction = boost_best_para_wkday$bag.fraction, train.fraction = 0.75, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 4000 iterations were performed.
## The best cross-validation iteration was 335.
## The best test-set iteration was 1078.
## There were 9 predictors of which 9 had non-zero influence.

Then we compare the performance of both the random forest model and the trees model using boosting algorithm on the Weekday validation set.

boost_pred_test_wkday <- predict(boost.best_wkday,
                           newdata = df_weekday[-train_wkday, ],
                           n.trees = 4000)
boost_RMSE_test_wkday <- caret::RMSE(boost_pred_test_wkday, df_weekday[-train_wkday, ]$No_of_transaction)

rf_pred_test_wkday <- predict(rf_reg_best_wkday,
                        newdata = df_weekday[-train_wkday, ])

rf_RMSE_test_wkday <- caret::RMSE(rf_pred_test_wkday, df_weekday[-train_wkday, ]$No_of_transaction)

print(paste0('RMSE of the boosting model on Weekday testing set:', round(boost_RMSE_test_wkday, 4)))
## [1] "RMSE of the boosting model on Weekday testing set:2.9753"
print(paste0('RMSE of the random forest model on Weekday testing set:', round(rf_RMSE_test_wkday, 4)))
## [1] "RMSE of the random forest model on Weekday testing set:2.8462"
print(paste0('RMSE of the lasso regression model on Weekday testing set:', round(lasso_RMSE_wkday, 4)))
## [1] "RMSE of the lasso regression model on Weekday testing set:2.8807"

For weekdays data, it turns out that the random forest model performs the best on the test set in term of RMSE. So we will choose the random forest model to explain how different features impact the number of transactions in a certain hour during a weekday. For the random forest model, the RMSE on the training set is 2.6818, which doesn’t suggest the risk of overfitting. However, one thing worth-noting is that our random forest could only explain 34% of variance. This is probably due to the very limited amount of features we have in our original dataset.

Weekend

search_grid <- expand.grid(mtry = c(round(num_features/3/2),
                                    round(num_features/3),
                                    round(num_features/3*2)),
                           ntree = c(100, 500, 1000))

min_RMSE <- Inf
best_mtry <- 0
best_ntree <- 0

for (i in seq(1, nrow(search_grid))){
  rf_reg_wkend <- randomForest(No_of_transaction~.,
                           df_weekend[train_wkend, ],
                           mtry = search_grid[i, ]$mtry,
                           ntree = search_grid[i, ]$ntree,
                           importance = TRUE
                           )
  if (mean(rf_reg_wkend$mse) < min_RMSE) {
    min_RMSE <- mean(rf_reg_wkend$mse)
    best_mtry <- search_grid[i, ]$mtry
    best_ntree <- search_grid[i, ]$ntree
    
  }

  }

# Train the random forest model using the best 'mtry' and 'ntree'.
rf_reg_best_wkend <- randomForest(No_of_transaction~.,
                           df_weekend[train_wkend, ],
                           mtry = best_mtry,
                           ntree = best_ntree,
                           importance = TRUE
                           )

rf_reg_best_wkend
## 
## Call:
##  randomForest(formula = No_of_transaction ~ ., data = df_weekend[train_wkend,      ], mtry = best_mtry, ntree = best_ntree, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 12.72002
##                     % Var explained: 41.38
search_grid <- expand.grid(shrinkage = c(0.01, 0.001),
                           interaction.depth = c(4, 5),
                           n.minobsinnode = c(10, 20),
                           bag.fraction = c(0.5, 0.8),
                           optimal_trees = 0,
                           min_RMSE = 0)

for (i in seq(1, nrow(search_grid))){

  boost.fit_wkend <- gbm(No_of_transaction~. ,
                  data = df_weekend[train_wkend, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = search_grid[i, ]$shrinkage,
                  interaction.depth = search_grid[i, ]$interaction.depth,
                  n.minobsinnode = search_grid[i, ]$n.minobsinnode,
                  bag.fraction = search_grid[i, ]$bag.fraction,
                  )
  search_grid[i, ]$min_RMSE <- min(boost.fit_wkend$cv.error)
  search_grid[i, ]$optimal_trees = match(min(boost.fit_wkend$cv.error), boost.fit_wkend$cv.error)

}

# Train and fit this model with the best sets of hyperparameters:

boost_best_para_wkend <- 
  search_grid[search_grid$min_RMSE == min(search_grid$min_RMSE),]

boost.best_wkend <- gbm(No_of_transaction~. ,
                  data = df_weekend[train_wkend, ], 
                  distribution = "gaussian",
                  cv.folds = 10,
                  n.trees = 4000,
                  train.fraction = 0.75,
                  n.cores = NULL,
                  verbose = FALSE,
                  shrinkage = boost_best_para_wkend$shrinkage,
                  interaction.depth = boost_best_para_wkend$interaction.depth,
                  n.minobsinnode = boost_best_para_wkend$n.minobsinnode,
                  bag.fraction = boost_best_para_wkend$bag.fraction,
                  )

boost.best_wkend
## gbm(formula = No_of_transaction ~ ., distribution = "gaussian", 
##     data = df_weekend[train_wkend, ], n.trees = 4000, interaction.depth = boost_best_para_wkend$interaction.depth, 
##     n.minobsinnode = boost_best_para_wkend$n.minobsinnode, shrinkage = boost_best_para_wkend$shrinkage, 
##     bag.fraction = boost_best_para_wkend$bag.fraction, train.fraction = 0.75, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 4000 iterations were performed.
## The best cross-validation iteration was 432.
## The best test-set iteration was 265.
## There were 9 predictors of which 8 had non-zero influence.
boost_pred_test_wkend <- predict(boost.best_wkend,
                           newdata = df_weekend[-train_wkend, ],
                           n.trees = 4000)
boost_RMSE_test_wkend <- caret::RMSE(boost_pred_test_wkend, df_weekend[-train_wkend, ]$No_of_transaction)

rf_pred_test_wkend <- predict(rf_reg_best_wkend,
                        newdata = df_weekend[-train_wkend, ])

rf_RMSE_test_wkend <- caret::RMSE(rf_pred_test_wkend, df_weekend[-train_wkend, ]$No_of_transaction)

print(paste0('RMSE of the boosting model on Weekend testing set:', round(boost_RMSE_test_wkend, 4)))
## [1] "RMSE of the boosting model on Weekend testing set:3.6731"
print(paste0('RMSE of the random forest model on Weekend testing set:', round(rf_RMSE_test_wkend, 4)))
## [1] "RMSE of the random forest model on Weekend testing set:3.5549"
print(paste0('RMSE of the lasso regression model on Weekend testing set:', round(lasso_RMSE_wkend, 4)))
## [1] "RMSE of the lasso regression model on Weekend testing set:4.0637"

Same as the case of weekday data, for weekend data, it seems that the random forest model selected performs better than the other model selected. For the random forest model, the RMSE on the training set is 3.5981, which doesn’t suggest the risk of overfitting. However, only 41% of variace could be explained by the selected random forest model. This is probably again due to the fact that we only have very limited features in our original dataset.

4.1.3 Impact of Features

Now that we have the regression model constructed, we would like to further understand the impact of different features.

Weekday

varImpPlot(rf_reg_best_wkday)

The most important factors are the time when the transaction happened and whether the time is the rush hours - this is not surprising. What’s interesting is that the weather related variables follow right after. We use partial dependence plot here to further understand the impact of weather related variables.

Weekday

par(mfrow = c(2, 2))
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'wdir')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'wspd')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'rhum')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'temp')

From the partial dependence plot for Weekday data, we could learn that:

  1. Temperature seems to have a U-shape effect on number of transactions. When the temperature is approximately below 2 Celsius, number of transactions decreases as temperature increases; when it is approximately above 2 Celsius, number of transactions increases as temperature increases. Potentially because during weekdays, there could be increasing number of transactions during rush hour, especially in the morning, when the temperature tends to be low.

  2. When there is north wind or north-west wind (270-350), the sales will be higher than the case when the wind is blowing from other directions.

  3. Average-level humidity seems to have a positive effect on number of transactions; however, when humidity is close to or over 90, which suggests raining weather, there is a sharp decrease, showing a negative effect on number of transactions.

  4. Stronger wind has negative impact on the sales

To sum up, the most important variable in forecasting the number of transactions at a given time is whether it is during the rush hour. Also, different weather related variables have different impact on the number of transaction - so a practical suggestion to the bakery is to watch the weather forecast and be prepared.

Weekend

varImpPlot(rf_reg_best_wkend)

Similar as the case of weekdays, the hour when a transaction happened and whether it is during rush-hours are the most important features. Then the weather related variables also contribute in the model.

Weekend

par(mfrow = c(2, 2))
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'wdir')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'wspd')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'rhum')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'temp')

From the partial dependence plot for Weekend data, we could learn that:

  1. Interestingly, in contrast to the data in weekdays, temperature no longer has a U-shape effect; instead, overall, as temperature increases, number of transactions increases. That’s probably because no one needs to go buy breakfast in a cold Saturday or Sunday morning.

  2. In weekends, when there is north wind or north-west wind (270-350), there is a huge increase in number of transactions.

  3. In weekends, as humidity increases, number of transactions decreases overall. Indeed, it decreases dramatically once the humidity is higher than 40. The reason why it is different comparing with weekday would be that only few people will hang out if weather condition is not good in a weekend.

  4. Just as weekdays, wind speed has a negative effect on number of transactions.

Comparing with the weekday case, the humidity and temperature impact the number of transactions differently. People have to go to work in weekdays regardless of the weather. But during weekend, the weather condition will play a more important role when people decide whether to go to the bakery.

4.2 Market Basket Analysis

The objective of this analysis is to identify possible association rules. For example, assuming we identify from the data that people who buy cake tend to buy cookie at the same time, then the bakery could consider to place these two item closer in the shop, or create bundle sales in order to boost the sales.

We have already re-grouped different items, so for this analysis, we will start from looking at the association rules among item types.

transactions_Item_Type <- as(df_merged_mba_item_type$basket, "transactions")
#inspect(transactions_Item_Type[1])

# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item_Type <- apriori(transactions_Item_Type, 
                 parameter = list(support = 0.05, 
                                  confidence = 0.025,
                                  minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##       0.025    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 246 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9 item(s), 4928 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules
rules_Item_Type <- rules_Item_Type[!is.redundant(rules_Item_Type)]

# Reorder by Lift and display
rules_Item_Type_dt <- data.table( lhs = labels( lhs(rules_Item_Type) ), 
                        rhs = labels( rhs(rules_Item_Type) ), 
                        quality(rules_Item_Type) )[ order(-lift), ]
kable(head(rules_Item_Type_dt, 10))
lhs rhs support confidence coverage lift count
{Bread} {Coffee} 0.2500000 0.5116279 0.4886364 0.8739349 1232
{Coffee} {Bread} 0.2500000 0.4270364 0.5854302 0.8739349 1232
{Cake|Pastry|Sweets} {Coffee} 0.2759740 0.5087916 0.5424107 0.8690902 1360
{Coffee} {Cake|Pastry|Sweets} 0.2759740 0.4714038 0.5854302 0.8690902 1360
{Bread} {Cake|Pastry|Sweets} 0.2297078 0.4700997 0.4886364 0.8666858 1132
{Cake|Pastry|Sweets} {Bread} 0.2297078 0.4234942 0.5424107 0.8666858 1132
{Meal} {Coffee} 0.0951705 0.4885417 0.1948052 0.8345003 469
{Coffee} {Meal} 0.0951705 0.1625650 0.5854302 0.8345003 469
{Hot chocolate|Smoothie|Juice} {Cake|Pastry|Sweets} 0.0651380 0.4439834 0.1467127 0.8185373 321
{Cake|Pastry|Sweets} {Hot chocolate|Smoothie|Juice} 0.0651380 0.1200898 0.5424107 0.8185373 321

From the table we observe that the highest lift of all rules is still smaller than 1. That says, no valid association rules identified among the item types.

Then we focus our market basket analysis on the detail list of items, instead of the category. Here we will analyze weekdays and weekends separately, as we are expecting different purchase behaviors.

4.2.1 Weekday Dataset Analysis and Discussion

Weekday

transactions_Item <- as(mba_wkday$basket, "transactions")
#inspect(transactions_Item[1])

# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS

rules_Item <- apriori(transactions_Item, 
                 parameter = list(support = 0.005, 
                                  confidence = 0.25,
                                  minlen = 2)
                 )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[85 item(s), 11825 transaction(s)] done [0.00s].
## sorting and recoding items ... [39 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [115 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules
rules_Item <- rules_Item[!is.redundant(rules_Item)]

# Reorder by Lift and display
rules_Item_dt_wkday <- data.table( lhs = labels( lhs(rules_Item) ), 
                        rhs = labels( rhs(rules_Item) ), 
                        quality(rules_Item) )[ order(-lift), ]

kable(head(rules_Item_dt_wkday, 20))
lhs rhs support confidence coverage lift count
{Coffee,Coke} {Sandwich} 0.0059197 0.4347826 0.0136152 3.897880 70
{Juice,Tea} {Cookies} 0.0053277 0.3519553 0.0151374 3.735971 63
{Coke} {Sandwich} 0.0109937 0.3735632 0.0294292 3.349041 130
{Mineral water} {Sandwich} 0.0085412 0.3519164 0.0242706 3.154974 101
{Coffee,Truffles} {Sandwich} 0.0050740 0.3260870 0.0155603 2.923410 60
{Coffee,Juice} {Cookies} 0.0105708 0.2642706 0.0400000 2.805207 125
{Cake,Soup} {Tea} 0.0055814 0.5689655 0.0098097 2.794027 66
{Coffee,Soup} {Sandwich} 0.0085412 0.2869318 0.0297674 2.572380 101
{Bread,Soup} {Tea} 0.0054123 0.3950617 0.0136998 1.940035 64
{Soup,Tea} {Cake} 0.0055814 0.2738589 0.0203805 1.935674 66
{Coffee,Juice} {Cake} 0.0103171 0.2579281 0.0400000 1.823072 122
{Spanish Brunch} {Tea} 0.0076956 0.3625498 0.0212262 1.780379 91
{Chicken Stew} {Tea} 0.0078647 0.3419118 0.0230021 1.679031 93
{Scone} {Tea} 0.0098943 0.3371758 0.0293446 1.655774 117
{Soup} {Tea} 0.0203805 0.3370629 0.0604651 1.655220 241
{Cookies,Juice} {Tea} 0.0053277 0.3150000 0.0169133 1.546875 63
{Extra Salami or Feta} {Coffee} 0.0054968 0.8552632 0.0064271 1.524263 65
{Tiffin} {Tea} 0.0064271 0.3102041 0.0207188 1.523324 76
{Keeping It Local} {Coffee} 0.0085412 0.8145161 0.0104863 1.451643 101
{Cake} {Tea} 0.0401691 0.2839211 0.1414799 1.394255 475
subrules2 <- head(sort(rules_Item, by = "lift"), 20)
ig <- plot( subrules2, method="graph",
            control=list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main  =  Graph for 20 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

Few observations could be made from the table:

During weekdays

  1. People who buy drinks, such as coffee, juice, tea, and mineral water, tends to buy Sandwich

  2. People who buy coffee and juice or juice and tea tends to buy cookies. Normally client won’t buy two drinks together if it is only for him or herself. We suspect in this case, it is someone buying drinks and cookies to share with friends.

  3. People who buy lunch, such as cake and soup, bread and soup, Spanish brunch, chicken stew, and soup alone, tends to buy tea at the same time.

  4. People who buy sweets, such as scone and cake, tends to buy tea at the same time.

To further explore the finding #4, we plot the distribution of transactions including scone or cake by hours.

p1<-mba_wkday%>%filter(Item=='Scone')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkday$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Scone by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p2<-mba_wkday%>%filter(Item=='Cake')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkday$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cake by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)


grid.arrange(p1, p2,
             nrow=1,ncol=2,
             top = textGrob("Average Transaction Frequency by Hours",
                            gp=gpar(fontsize=14,font=3)))

It turns out that scone and cake are sold more in the afternoon. The association rules between scone, cake and tea probably refers to the snack in the afternoon.

Based on the above, maybe we could recommend the bakery, for weekdays:

  1. To create bundles sales for drinks and sandwiches

  2. To create bundles sales for multiple drinks and cookies or cakes

  3. To create bundles sales for lunch and tea

  4. To create bundles sales for afternoon snack and tea

4.2.2 Weekend Dataset Analysis and Discussion

Weekend

transactions_Item <- as(mba_wkend$basket, "transactions")
#inspect(transactions_Item[1])

# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item_wkend <- apriori(transactions_Item, 
                 parameter = list(support = 0.01, 
                                  confidence = 0.25,
                                  minlen = 2)
                 )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 70 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 7047 transaction(s)] done [0.00s].
## sorting and recoding items ... [34 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [84 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules
rules_Item_wkend <- rules_Item_wkend[!is.redundant(rules_Item_wkend)]

# Reorder by Lift and display
rules_Item_dt_wkend <- data.table( lhs = labels( lhs(rules_Item_wkend) ), 
                        rhs = labels( rhs(rules_Item_wkend) ), 
                        quality(rules_Item_wkend) )[ order(-lift), ]

kable(head(rules_Item_dt_wkend, 10))
lhs rhs support confidence coverage lift count
{Farm House} {Medialuna} 0.0102171 0.2599278 0.0393075 2.351362 72
{Jammie Dodgers} {Cake} 0.0103590 0.3273543 0.0316447 1.909657 73
{Coffee,Tea} {Cake} 0.0261104 0.2813456 0.0928054 1.641260 184
{Coffee,Hot chocolate} {Cake} 0.0184476 0.2771855 0.0665531 1.616992 130
{Bread,Tea} {Cake} 0.0129133 0.2684366 0.0481056 1.565954 91
{Cake} {Tea} 0.0488151 0.2847682 0.1714205 1.530711 344
{Tea} {Cake} 0.0488151 0.2623951 0.1860366 1.530711 344
{Hot chocolate} {Cake} 0.0278133 0.2525773 0.1101178 1.473437 196
{Scone} {Tea} 0.0224209 0.2585925 0.0867036 1.390008 158
{Hot chocolate,Pastry} {Coffee} 0.0100752 0.7553191 0.0133390 1.335692 71
# Graph for 10 rules - Weekend
subrules2 <- head(sort(rules_Item_wkend, by = "lift"), 10)
ig <- plot( subrules2, method="graph",
            control=list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main  =  Graph for 10 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

In weekends, the association rules concentrate on few items, including cake, tea, coffee, hot chocolate, and scone. One thing worth-noting is that cake is in the center of the graph. In other words, people tends to buy cake when they are purchasing bunch of other items. The association rules also suggest that people sometime buy items to share with others, as there are transactions including both coffee and tea or coffee and hot chocolate.

Since cake is the center of the associations, we are curious when the sales of cakes peak and how the sales of different drinks associated with cakes vary by the hours of the day.

p3<-mba_wkend%>%filter(Item=='Cake')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Cake by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p4<-mba_wkend%>%filter(Item=='Coffee')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Coffee by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p5<-mba_wkend%>%filter(Item=='Tea')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Tea by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

p6<-mba_wkend%>%filter(Item=='Hot chocolate')%>%
  group_by(as.factor(Hour))%>%
  mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
  ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
  geom_bar(stat = 'identity',alpha=0.8) +
  labs(title = 'Average Number of Transactions of Hot chocolate by Hour',
       x = 'Hours', y = 'Avg No. of Transaction') +
  scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
  theme_minimal() +
  theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)

grid.arrange(p3,p4, p5,p6,
             nrow=2,ncol=2,
             top = textGrob("Average Transaction Frequency by Hours",
                            gp=gpar(fontsize=14,font=3)))

From the figures above, coffee and hot chocolate are sold more in the morning, as opposed to tea and hot chocolate. Sales of cake peak in the afternoon. It is likely that in weekends, people hang out together at the bakery to get the cake and multiple drinks. The bakery could make a special weekend afternoon combo for cake and drinks.

Based on the observations, the recommendations to the bakery would be:

  1. During weekends, place the cake in a visible place closer to the cashier

  2. Keep the combo of multiple drinks and cake

4.3 Clustering Analysis and Discussion

In this section, we will explore the different clusters of transactions, if there is any.

4.3.1 Clustering on Weekday Dataset

Weekday

cluster_wkday_scale = scale(cluster_wkday)

fviz_nbclust(
  cluster_wkday_scale,
  kmeans,
  k.max = 10,
  method = "wss"
) + ggtitle ("Elbow Method")

fviz_nbclust(
  cluster_wkday_scale,
  kmeans,
  k.max = 10,
  method = "silhouette"
) + ggtitle ("Silhouette Method")

Both figures suggest that the best K should be 6. However, after trying several different K, we decide to go with K = 3.

set.seed(0)
km_out_2 <- kmeans(cluster_wkday_scale, 4, nstart = 25)

fviz_cluster(km_out_2, cluster_wkday_scale, geom = 'point', ellipse.type = 'norm',
             main = 'Weekday Dataset: K-Means Clustering Results with K=4') + 
    scale_fill_manual(values = cookiecol[c(7,5,1,4)]) +
  scale_color_manual(values = cookiecol[c(7,5,1,4)]) + 
  theme_minimal()

Using PCA, we are able to visualize the clusters in a 2-D plane. Then to understand how well we split the transactions into two clusters, here we compare the centers of the clusters.

library('DMwR')
unscale(km_out_2$centers, cluster_wkday_scale)
##   Rush_hours Holiday Item_Type     temp     rhum     wdir     wspd     coco
## 1   1.000000       1  3.926362 7.379971 77.18851 216.1267 16.18395 1.319588
## 2   1.981895       1  4.452211 9.124211 73.58316 236.4000 21.98800 1.267368
## 3   1.869395       1  3.942265 4.073991 87.50224 177.8363 10.30409 1.168722
## 4   1.882759       2  4.400000 8.579310 79.96552 234.8276 22.35103 1.027586
##       Hour
## 1 16.45066
## 2 12.54063
## 3 10.92265
## 4 12.73103

In general, one of the clusters only contains the non-rush hour transactions. Then the other two clusters seem to be split based on weather conditions, including humidity, wind direction and wind speed.

cluster_wkday_cbind<-cbind(km_out_2$cluster, cluster_wkday)
cluster_wkday1<-cluster_wkday_cbind%>%filter(km_out_2$cluster==1)
cluster_wkday2<-cluster_wkday_cbind%>%filter(km_out_2$cluster==2)
cluster_wkday3<-cluster_wkday_cbind%>%filter(km_out_2$cluster==3)
cluster_wkday4<-cluster_wkday_cbind%>%filter(km_out_2$cluster==4)
names(cluster_wkday_cbind)[1]<-'cluster_no'
cluster_wkday_cbind$cluster_no<-as.factor(cluster_wkday_cbind$cluster_no)
cluster_wkday_cbind %>%
  ggplot( aes(x=as.factor(Item_Type),fill=cluster_no)) + facet_wrap(~ cluster_no) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', binwidth = 0.5) +
    scale_fill_brewer(palette = 'BrBG') +
    theme_minimal() 
## Warning: Ignoring unknown parameters: binwidth

From the graph above, cluster 2 has the highest number of transactions while cluster 3 has the lowest which is explained by the holiday data in cluster 3. The number of transactions of cluster 1,2,4 which all contain non-holiday data, are still quite different. Let’s look at the details of other features in these two clusters.

p1<-cluster_wkday1 %>%
  ggplot(aes(x=as.factor(Item_Type))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[1]) +
    theme_minimal() 
p2<-cluster_wkday1 %>%
  ggplot(aes(x=as.factor(Rush_hours))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[1]) +
    theme_minimal() 
p3<-cluster_wkday1 %>%
  ggplot(aes(x=as.factor(Holiday))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[1]) +
    theme_minimal() 
p4<-cluster_wkday1 %>%
 ggplot(aes(x=temp)) +
    geom_density(color=cookiecol[1], fill=cookiecol[1], alpha=0.6, position = 'identity') +
    theme_minimal() 
p5<-cluster_wkday1 %>%
 ggplot(aes(x=wspd)) +
    geom_density(color=cookiecol[1], fill=cookiecol[1], alpha=0.6, position = 'identity') +
    theme_minimal() 
p6<-cluster_wkday1 %>%
 ggplot(aes(x=rhum)) +
    geom_density(color=cookiecol[1], fill=cookiecol[1], alpha=0.6, position = 'identity') +
    theme_minimal() 
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=4,ncol=2)

Cluster 1 contains data in non-rush-hours in non-holidays, with a mean temperature of around 7, relatively low wind speed , high humidity, and low number of transactions in different item types.

p1<-cluster_wkday2 %>%
  ggplot(aes(x=as.factor(Item_Type))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[5]) +
    theme_minimal() 
p2<-cluster_wkday2 %>%
  ggplot(aes(x=as.factor(Rush_hours))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[5]) +
    theme_minimal() 
p3<-cluster_wkday2 %>%
  ggplot(aes(x=as.factor(Holiday))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[5]) +
    theme_minimal() 
p4<-cluster_wkday2 %>%
 ggplot(aes(x=temp)) +
    geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
    theme_minimal() 
p5<-cluster_wkday2 %>%
 ggplot(aes(x=wspd)) +
    geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
    theme_minimal() 
p6<-cluster_wkday2 %>%
 ggplot(aes(x=rhum)) +
    geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
    theme_minimal() 
p7<-cluster_wkday2 %>%
 ggplot(aes(x=Hour)) +
    geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
    theme_minimal() 
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=3,ncol=2)

Cluster 2 contains data in rush hours in non-holidays, with relatively high temperature, relatively high wind speed, relatively low humidity, and relatively high number of transactions of different items.

p1<-cluster_wkday3 %>%
  ggplot(aes(x=as.factor(Item_Type))) +
    geom_bar(stat = 'count',alpha=0.8, position = 'identity', fill=cookiecol[7]) +
    theme_minimal() 
p2<-cluster_wkday3 %>%
  ggplot(aes(x=as.factor(Rush_hours))) +
    geom_bar(stat = 'count',alpha=0.8, position = 'identity', fill=cookiecol[7]) +
    theme_minimal() 
p3<-cluster_wkday3 %>%
  ggplot(aes(x=as.factor(Holiday))) +
    geom_bar(stat = 'count',alpha=0.8, position = 'identity', fill=cookiecol[7]) +
    theme_minimal() 
p4<-cluster_wkday3 %>%
 ggplot(aes(x=temp)) +
    geom_density(color=cookiecol[7], fill=cookiecol[7], alpha=0.8, position = 'identity') +
    theme_minimal() 
p5<-cluster_wkday3 %>%
 ggplot(aes(x=wspd)) +
    geom_density(color=cookiecol[7], fill=cookiecol[7], alpha=0.8, position = 'identity') +
    theme_minimal() 
p6<-cluster_wkday3 %>%
 ggplot(aes(x=rhum)) +
    geom_density(color=cookiecol[7], fill=cookiecol[7], alpha=0.8, position = 'identity') +
    theme_minimal() 
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=3,ncol=2)

Cluster 3 contains holiday data which has even lower number of transactions comparing to cluster 1 (non-rush hour in non-holiday) even though temperature is relatively higher, humidity is relatively lower.

p1<-cluster_wkday4 %>%
  ggplot(aes(x=as.factor(Item_Type))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[8]) +
    theme_minimal() 
p2<-cluster_wkday4 %>%
  ggplot(aes(x=as.factor(Rush_hours))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[8]) +
    theme_minimal() 
p3<-cluster_wkday4 %>%
  ggplot(aes(x=as.factor(Holiday))) +
    geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[8]) +
    theme_minimal() 
p4<-cluster_wkday4 %>%
 ggplot(aes(x=temp)) +
    geom_density(color=cookiecol[8], fill=cookiecol[8], alpha=0.6, position = 'identity') +
    theme_minimal() 
p5<-cluster_wkday4 %>%
 ggplot(aes(x=wspd)) +
    geom_density(color=cookiecol[8], fill=cookiecol[8], alpha=0.6, position = 'identity') +
    theme_minimal() 
p6<-cluster_wkday4 %>%
 ggplot(aes(x=rhum)) +
    geom_density(color=cookiecol[8], fill=cookiecol[8], alpha=0.6, position = 'identity') +
    theme_minimal() 
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=3,ncol=2)

cluster 4 represents (mostly) rush-hours in non-holiday, with a mean temperature of close to 5, relatively low wind speed, relatively high humidity, and relatively higher number of transactions in different item types.

In conclusion:

Since all clusters except for cluster 1 have quite a lot overlapping, we can only obtain limited information from the clustering results now. We can only say that lower number of transactions tends to happen in holidays as opposed to non-holidays.

4.3.2 Clustering on Weekend Dataset

Weekend

cluster_wkend_scale = scale(cluster_wkend)

fviz_nbclust(
  cluster_wkend_scale,
  kmeans,
  k.max = 10,
  method = "wss"
) + ggtitle ("Elbow Method")

fviz_nbclust(
  cluster_wkend_scale,
  kmeans,
  k.max = 10,
  method = "silhouette"
) + ggtitle ("Silhouette Method")

Both figures suggest that the best K should be 8. However, after trying several differnt K, we select K = 3.

set.seed(0)
km_out_2 <- kmeans(cluster_wkend_scale, 3, nstart = 25)

fviz_cluster(km_out_2, cluster_wkend_scale, geom = 'point', ellipse.type = 'norm',
             main = 'Weekend Dataset: K-Means Clustering Results with K=3') + 
  scale_fill_manual(values = cookiecol[c(5,1,7)]) +
  scale_color_manual(values = cookiecol[c(5,1,7)]) + 
  theme_minimal()

Then to understand how well we split the transactions into two clusters, here we compare the centers of the clusters.

unscale(km_out_2$centers, cluster_wkend_scale)
##   Rush_hours Holiday Item_Type     temp     rhum     wdir     wspd     coco
## 1   1.911582       1  4.379826 4.418431 89.62391 161.5068 13.04633 1.724782
## 2   1.815981       1  4.641646 9.690880 75.37772 249.4592 17.01324 1.045198
## 3   1.888889       2  4.311111 7.911111 78.13333 245.0000 36.68000 1.688889
##       Hour
## 1 11.59651
## 2 13.07345
## 3 12.06667
cluster_wkend_cbind<-cbind(km_out_2$cluster, cluster_wkend)
cluster_wkend1<-cluster_wkend_cbind%>%filter(km_out_2$cluster==1)
cluster_wkend2<-cluster_wkend_cbind%>%filter(km_out_2$cluster==2)
cluster_wkend3<-cluster_wkend_cbind%>%filter(km_out_2$cluster==3)
names(cluster_wkend_cbind)[1]<-'cluster_no'
cluster_wkend_cbind$cluster_no<-as.factor(cluster_wkend_cbind$cluster_no)
cluster_wkend_cbind %>%
  ggplot( aes(x=as.factor(Item_Type),fill=cluster_no)) + facet_wrap(~ cluster_no) +
    geom_bar(stat = 'count',alpha=0.8, position = 'identity', binwidth = 0.5) +
    scale_fill_manual(values = cookiecol[c(5,1,7)]) +
    xlab('Item Type') + 
    ylab('Number of Transactions') + 
    theme_minimal() 
## Warning: Ignoring unknown parameters: binwidth

From the graph above, cluster 2 has the highest number of transactions while cluster 3 has the lowest which is explained by the holiday data in cluster 3. The number of transactions of cluster 1 and cluster 2, which contain both non-holiday data, are still quite different. Let’s look at the details of other features in these two clusters.

cluster_wkend_cbind$cluster_no<-as.factor(cluster_wkend_cbind$cluster_no)
p1<-cluster_wkend_cbind %>%
  filter(cluster_no%in%c(1,2)) %>%
  ggplot(aes(x=Hour, fill=cluster_no)) + 
    geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
    scale_fill_manual(values = cookiecol[c(5,1)]) +
    theme_minimal() +
    labs(fill="") 

p2<-cluster_wkend_cbind %>%
  filter(cluster_no%in%c(1,2)) %>%
  ggplot(aes(x=temp, fill=cluster_no)) + 
    geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
    scale_fill_manual(values = cookiecol[c(5,1)]) +
    theme_minimal() +
    labs(fill="") +
    xlab('Temperature') 

p3<-cluster_wkend_cbind %>%
  filter(cluster_no%in%c(1,2)) %>%
  ggplot(aes(x=wspd, fill=cluster_no)) + 
    geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
    scale_fill_manual(values = cookiecol[c(5,1)]) +
    theme_minimal() +
    labs(fill="") +
    xlab('Wind Speed') 

p4<-cluster_wkend_cbind %>%
  filter(cluster_no%in%c(1,2)) %>%
  ggplot(aes(x=rhum, fill=cluster_no)) + 
    geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
    scale_fill_manual(values = cookiecol[c(5,1)]) +
    theme_minimal() +
    labs(fill="") +
    xlab('Humidity') 

grid.arrange(p1,p2,p3,p4,nrow=2,ncol=2)

Cluster 2 tends to contain data in the afternoons, with relatively higher temperature, higher wind speed and relatively lower humidity. However, there is still a lot of overlapping between two clusters, so limited information could be derived from the clustering results now.

5 Conclusion & Next Steps

In a nutshell, with the bakery transaction data and weather record data, we first separated weekdays from weekend. Then we explored how different features influencing the number of transactions, especially those weather-related variables. After that we dived into the association rules from the transaction history, identified purchasing patterns, and provided recommendations for the bakery to further increase sales. In the end we tried to cluster the transactions into different segments, and discussed the initial results.

The major next steps of this analysis are:

  1. When predicting the number of transactions, the selected random forest model could only explain 30% - 40% of variance. We will need to either acquire more data beyond Oct 2016 and Apr 2017, or get more new features. Our current features only include date, time, name of items, and weather. If we could manage to get the information of the client, we could further improve our model to perform the prediction task.

  2. The association rules we identified has low support in general. If we could get more data, we might be able to further explore the the possible rules.

  3. We could dive deeper into the clustering analysis, to further explore the similarities of transactions within the same cluster. If we could acquire more features, such as the demographic information about the client, we could further explore the clustering analysis and eventually turn the findings into executable actions to the bakery.

  4. Whether a day is a public holiday or not is not significant in any of our analysis. Part of the reason is that our dataset only lasts 6 months. If we could acquire more data, covering several years, we could further explore the impact of holiday on the sales of the bakery.